#Loading the packages and data as shown on the previous guide "Introduction to R".
packages.to.load <- c("plyr", "dplyr" ,"DT", "ggplot2", "plotly", "RODBC")
invisible(lapply(packages.to.load, library, character.only=TRUE))
load("C:/Users/shregmi/Documents/Current Projects/R Training/Example Datasets/Auto.rda")
write.csv(Auto, file="Autodata.csv")
auto.data <- read.csv(file ="C:/Users/shregmi/Documents/Current Projects/R Training/Autodata.csv", header = TRUE)
This section will cover a couple ways to shape data into something that is easy to work with.
For very simple filters and selection of certain columns from a data frame, you can use the $ operator. Here, we take one column from our auto.data data frame and assign it to a new object. You can use the class() function to see what kind of object it is.
JustMPG <- matrix(auto.data$mpg)
class(JustMPG)
## [1] "matrix"
We can also pull multiple columns pretty easily.
MPGandWeight <- data.frame(auto.data$mpg, auto.data$weight)
head(MPGandWeight)
## auto.data.mpg auto.data.weight
## 1 18 3504
## 2 15 3693
## 3 18 3436
## 4 16 3433
## 5 17 3449
## 6 15 4341
The package dplyr is a heavily used library for data cleaning and preparation (one of R’s main strong suits). Here is a little cheat sheet covering some dplyr and tidyr features: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
There are 5 main functions or “verbs” used in dplyr: Select, Filter, Mutate, Arrange, and Summarize. Many of these align with the main SQL commands: Select, Where, Group By, etc…
Let’s take a subset of columns from the auto.data dataset.
auto.subset <- select(auto.data, name, mpg, cylinders, weight)
head(auto.subset)
## name mpg cylinders weight
## 1 chevrolet chevelle malibu 18 8 3504
## 2 buick skylark 320 15 8 3693
## 3 plymouth satellite 18 8 3436
## 4 amc rebel sst 16 8 3433
## 5 ford torino 17 8 3449
## 6 ford galaxie 500 15 8 4341
We can also specify which columns to exclude using the “-” operator.
auto.subset2 <- select(auto.data, -horsepower, -year)
head(auto.subset2)
## X mpg cylinders displacement weight acceleration origin
## 1 1 18 8 307 3504 12.0 1
## 2 2 15 8 350 3693 11.5 1
## 3 3 18 8 318 3436 11.0 1
## 4 4 16 8 304 3433 12.0 1
## 5 5 17 8 302 3449 10.5 1
## 6 6 15 8 429 4341 10.0 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
After selecting your data of interest, you can pass filters in the same way as you would a “where” clause in SQL.
only.4cyl <- filter(auto.subset, cylinders==4)
## Warning: package 'bindrcpp' was built under R version 3.3.3
head(only.4cyl)
## name mpg cylinders weight
## 1 toyota corona mark ii 24 4 2372
## 2 datsun pl510 27 4 2130
## 3 volkswagen 1131 deluxe sedan 26 4 1835
## 4 peugeot 504 25 4 2672
## 5 audi 100 ls 24 4 2430
## 6 saab 99e 25 4 2375
You can add multiple arguments to the filter function.
multi.filter <- filter(auto.subset, cylinders %in% c(4,6), weight<3000)
head(multi.filter)
## name mpg cylinders weight
## 1 toyota corona mark ii 24 4 2372
## 2 plymouth duster 22 6 2833
## 3 amc hornet 18 6 2774
## 4 ford maverick 21 6 2587
## 5 datsun pl510 27 4 2130
## 6 volkswagen 1131 deluxe sedan 26 4 1835
We can filter strings as well.
filter.name <- filter(auto.subset, cylinders==4, weight<2000, grepl("toyota|volkswagen", name))
head(filter.name)
## name mpg cylinders weight
## 1 volkswagen 1131 deluxe sedan 26 4 1835
## 2 toyota corolla 1200 31 4 1773
## 3 volkswagen model 111 27 4 1834
## 4 volkswagen super beetle 26 4 1950
## 5 toyota corolla 1200 32 4 1836
## 6 volkswagen dasher 26 4 1963
Using mutate, we can create new columns. Additionally, we can chain the previous “verbs” together using the “%>%” operator (called pipe operator). Note that the dplyr package must be loaded in order to use this operator.
add.col <- auto.data %>%
select(name, mpg, cylinders, weight) %>%
filter(weight>1800) %>%
mutate(weightpercyl <- weight/cylinders)
head(add.col)
## name mpg cylinders weight
## 1 chevrolet chevelle malibu 18 8 3504
## 2 buick skylark 320 15 8 3693
## 3 plymouth satellite 18 8 3436
## 4 amc rebel sst 16 8 3433
## 5 ford torino 17 8 3449
## 6 ford galaxie 500 15 8 4341
## weightpercyl <- weight/cylinders
## 1 438.000
## 2 461.625
## 3 429.500
## 4 429.125
## 5 431.125
## 6 542.625
This is equivalent to an “order by” statement in SQL. Here, we sort the previous output by “weightpercyl” in descending order using the “-” operator.
add.col.sorted <- auto.data %>%
select(name, mpg, cylinders, weight) %>%
filter(weight>1800) %>%
mutate(weightpercyl = weight/cylinders) %>%
arrange(-weightpercyl)
head(add.col.sorted)
## name mpg cylinders weight weightpercyl
## 1 mazda rx-4 21.5 3 2720 906.6667
## 2 peugeot 504 19.0 4 3270 817.5000
## 3 mercedes-benz 240d 30.0 4 3250 812.5000
## 4 peugeot 505s turbo diesel 28.1 4 3230 807.5000
## 5 mazda rx-7 gs 23.7 3 2420 806.6667
## 6 peugeot 504 27.2 4 3190 797.5000
summarized.auto <- auto.data %>%
select(name, mpg, cylinders, weight) %>%
filter(weight>1800) %>%
mutate(weightpercyl = weight/cylinders) %>%
summarise(avg_mpg = mean(mpg),
min_weight = min(weight),
median_weightpercyl = median(weightpercyl))
summarized.auto
## avg_mpg min_weight median_weightpercyl
## 1 23.18903 1825 544.1667
We can pass a group_by() clause as well.
summarized.auto2 <- auto.data %>%
select(name, mpg, cylinders, weight) %>%
filter(weight>1800) %>%
mutate(weightpercyl = weight/cylinders) %>%
group_by(cylinders) %>%
summarise(avg_mpg = mean(mpg),
avg_weight = mean(weight),
avg_weightpercyl = mean(weightpercyl))
summarized.auto2
## # A tibble: 5 x 4
## cylinders avg_mpg avg_weight avg_weightpercyl
## <int> <dbl> <dbl> <dbl>
## 1 3 20.6 2398. 800.
## 2 4 29.0 2331. 583.
## 3 5 27.4 3103. 621.
## 4 6 20.0 3202. 534.
## 5 8 15.0 4115. 514.
This section will cover a few useful packages and methods for visualization in an R Markdown document.
When using print() or head() functions to show the contents of a data frame, the output looks very ugly. There are a huge number of packages that are specifically for displaying tables in an aesthetically pleasing way. The “DT” or Data Tables package is one of them.
datatable(auto.data)
This is the default way to display a data frame with this package with no arguments provided except for the object that is to be shown.
There are a huge number of arguments you can add to make your reports look nicer and to add functionality. BONUS: You’ll also write the smallest piece of JavaScript code in this as well.
datatable(auto.data, extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons= c('copy', 'excel'),
pageLength = 5,
initComplete = JS("
function(settings, json) {
$(this.api().table().header()).css({
'background-color': '#232D69',
'color': '#fff'
});
}")
), rownames = FALSE)
Cheat sheet for plotly: https://images.plot.ly/plotly-documentation/images/r_cheat_sheet.pdf.
Another great resource: https://plot.ly/r/
plot <- plot_ly(x = auto.data$weight, type="histogram")
plot
In plotly, you can add multiple sets of traces or bars to the same plot with the use of pipe operators. You can also name each of the traces/bars so that the result is clear.
plot2 <- plot_ly(data =auto.data, alpha = 0.5) %>% #adjusting color transparency
add_histogram(x = auto.data$weight[auto.data$origin == 1], name="American") %>%
add_histogram(x = auto.data$weight[auto.data$origin == 2], name="European") %>%
add_histogram(x = auto.data$weight[auto.data$origin == 3], name="Japanese") %>%
layout(barmode = "overlay")
plot2
Here is an example of a scatterplot generated by plotly. This has many more features and much more functionality than the plot we first generated with the R Base plot() function.
auto.data$OriginName[which(auto.data$origin == 1)] = "American"
auto.data$OriginName[which(auto.data$origin == 2)] = "European"
auto.data$OriginName[which(auto.data$origin == 3)] = "Japanese"
plot3 <- plot_ly(data = auto.data, x = ~weight, y = ~mpg, color = ~OriginName, type ="scatter", mode="markers", text= ~paste("Car name: ", name ,
"</br> Year: ", year,
"</br> Cylinders: ", cylinders))
plot3
***
Cheat sheet for ggplot2: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf